18 research outputs found
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Procedural activity understanding requires perceiving human actions in terms
of a broader task, where multiple keysteps are performed in sequence across a
long video to reach a final goal state -- such as the steps of a recipe or a
DIY fix-it task. Prior work largely treats keystep recognition in isolation of
this broader structure, or else rigidly confines keysteps to align with a
predefined sequential script. We propose discovering a task graph automatically
from how-to videos to represent probabilistically how people tend to execute
keysteps, and then leverage this graph to regularize keystep recognition in
novel videos. On multiple datasets of real-world instructional videos, we show
the impact: more reliable zero-shot keystep localization and improved video
representation learning, exceeding the state of the art.Comment: Technical Repor
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
Searching long egocentric videos with natural language queries (NLQ) has
compelling applications in augmented reality and robotics, where a fluid index
into everything that a person (agent) has seen before could augment human
memory and surface relevant information on demand. However, the structured
nature of the learning problem (free-form text query inputs, localized video
temporal window outputs) and its needle-in-a-haystack nature makes it both
technically challenging and expensive to supervise. We introduce
Narrations-as-Queries (NaQ), a data augmentation strategy that transforms
standard video-text narrations into training data for a video query
localization model. Validating our idea on the Ego4D benchmark, we find it has
tremendous impact in practice. NaQ improves multiple top models by substantial
margins (even doubling their accuracy), and yields the very best results to
date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in
the CVPR and ECCV 2022 competitions and topping the current public leaderboard.
Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique
properties of our approach such as the ability to perform zero-shot and
few-shot NLQ, and improved performance on queries about long-tail object
categories. Code and models:
{\small\url{http://vision.cs.utexas.edu/projects/naq}}.Comment: 13 pages, 7 figures, appearing in CVPR 202
SpotEM: Efficient Video Search for Episodic Memory
The goal in episodic memory (EM) is to search a long egocentric video to
answer a natural language query (e.g., "where did I leave my purse?"). Existing
EM methods exhaustively extract expensive fixed-length clip features to look
everywhere in the video for the answer, which is infeasible for long
wearable-camera videos that span hours or even days. We propose SpotEM, an
approach to achieve efficiency for a given EM method while maintaining good
accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that
learns to identify promising video regions to search conditioned on the
language query; 2) a set of low-cost semantic indexing features that capture
the context of rooms, objects, and interactions that suggest where to look; and
3) distillation losses that address the optimization issues arising from
end-to-end joint training of the clip selector and EM model. Our experiments on
200+ hours of video from the Ego4D EM Natural Language Queries benchmark and
three different EM models demonstrate the effectiveness of our approach:
computing only 10% - 25% of the clip features, we preserve 84% - 97% of the
original EM model's accuracy. Project page:
https://vision.cs.utexas.edu/projects/spotemComment: Published in ICML 202
EgoEnv: Human-centric environment representations from egocentric video
First-person video highlights a camera-wearer's activities in the context of
their persistent environment. However, current video understanding approaches
reason over visual features from short video clips that are detached from the
underlying physical space and capture only what is immediately visible. To
facilitate human-centric environment understanding, we present an approach that
links egocentric video and the environment by learning representations that are
predictive of the camera-wearer's (potentially unseen) local surroundings. We
train such models using videos from agents in simulated 3D environments where
the environment is fully observable, and test them on human-captured real-world
videos from unseen environments. On two human-centric video tasks, we show that
models equipped with our environment-aware features consistently outperform
their counterparts with traditional clip features. Moreover, despite being
trained exclusively on simulated videos, our approach successfully handles
real-world videos from HouseTours and Ego4D, and achieves state-of-the-art
results on the Ego4D NLQ challenge. Project page:
https://vision.cs.utexas.edu/projects/ego-env/Comment: Published in NeurIPS 2023 (Oral
Habitat-Matterport 3D Semantics Dataset
We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is
the largest dataset of 3D real-world spaces with densely annotated semantics
that is currently available to the academic community. It consists of 142,646
object instance annotations across 216 3D spaces and 3,100 rooms within those
spaces. The scale, quality, and diversity of object annotations far exceed
those of prior datasets. A key difference setting apart HM3DSEM from other
datasets is the use of texture information to annotate pixel-accurate object
boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object
Goal Navigation task using different methods. Policies trained using HM3DSEM
perform outperform those trained on prior datasets. Introduction of HM3DSEM in
the Habitat ObjectNav Challenge lead to an increase in participation from 400
submissions in 2021 to 1022 submissions in 2022.Comment: 14 Pages, 10 Figures, 5 Table
A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems
Despite the advancement of machine learning techniques in recent years,
state-of-the-art systems lack robustness to "real world" events, where the
input distributions and tasks encountered by the deployed systems will not be
limited to the original training context, and systems will instead need to
adapt to novel distributions and tasks while deployed. This critical gap may be
addressed through the development of "Lifelong Learning" systems that are
capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3)
Scalability. Unfortunately, efforts to improve these capabilities are typically
treated as distinct areas of research that are assessed independently, without
regard to the impact of each separate capability on other aspects of the
system. We instead propose a holistic approach, using a suite of metrics and an
evaluation framework to assess Lifelong Learning in a principled way that is
agnostic to specific domains or system techniques. Through five case studies,
we show that this suite of metrics can inform the development of varied and
complex Lifelong Learning systems. We highlight how the proposed suite of
metrics quantifies performance trade-offs present during Lifelong Learning
system development - both the widely discussed Stability-Plasticity dilemma and
the newly proposed relationship between Sample Efficient and Robust Learning.
Further, we make recommendations for the formulation and use of metrics to
guide the continuing development of Lifelong Learning systems and assess their
progress in the future.Comment: To appear in Neural Network
Hybrid EMD-RF Model for Predicting Annual Rainfall in Kerala, India
Rainfall forecasting is critical for the economy, but it has proven difficult due to the uncertainties, complexities, and interdependencies that exist in climatic systems. An efficient rainfall forecasting model will be beneficial in implementing suitable measures against natural disasters such as floods and landslides. In this paper, a novel hybrid model of empirical mode decomposition (EMD) and random forest (RF) was developed to enhance the accuracy of annual rainfall prediction. The EMD technique was utilized to decompose the rainfall signal into six intrinsic mode functions (IMFs) to extract underlying patterns, while the RF algorithm was employed to make predictions based on the IMFs. The hybrid RF–IMF model was trained and tested using a dataset of annual rainfall in Kerala from 1871 to 2020, and its performance was compared to traditional models such as RF regression and the autoregressive moving average (ARMA) model. Mean absolute error (MAE), mean absolute percentage error (MAPE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination or R-squared (R2) were used to compare the performances of these three models. Model evaluation metrics show that the RF–IMF model outperformed both the RF model and ARMA model
Micropropagation prospective of cotyledonary explants of <i>Decalepis hamiltonii</i> Wight & Arn.—An endangered edible species
256-260The study was undertaken to standardize the development of callus, shoot and root regeneration from cotyledonary explant of Decalepis hamiltonii Wight & Arn. through the tissue culture techniques. The MS medium supplemented with 6-benzyl amino purine (BA), 2,4-dichlorophenoxy acetic acid (2,4-D), kinetin (Kn), gibberelic acid (GA3), indole acetic acid (IAA), indole butyric acid (IBA) and 1-naphthalene acetic acid (NAA) was used for callus, shoot and root regeneration. The maximum percentage (82.0%)of callus formation was achieved on 0.5 mg/L BA in combination with 0.05 mg/L Kn, followed by 78.5% of callus formation on 0.5 mg/L 2,4-D fortified with 0.05 mg/L Kn. The highest shoot proliferation (4.6 shoots/callus) and shoot length (6.9 cm) was achieved on 1.0 mg/L BA combined with 0.1 mg/L GA3, followed by 3.8 shoots per callus and 5.8 cm shoot length on 1.0 mg/L IAA combined with 0.1 mg/L GA3. The highest root formation (38.2 roots/shoot) and root length (11.8cm) was achieved on ½ strength MS medium fortified with 0.4 mg/L IBA, followed by 36.5 roots per shoot and root length of 10.7 cm on 0.4 mg/L NAA. The well-developed rooted plantlets were hardened in the mixtures of forest soil, soil and vermiculite (1:1:1) and 97.5% plantlets survived after hardening
Comparison of patient and graft survival in tacrolimus versus cyclosporine-based immunosuppressive regimes in renal transplant recipients – Single-center experience from South India
Studies have shown better graft function and reduced acute rejection rates among renal transplant recipients who were on Tacrolimus (Tac)-based immunosuppression regimens as compared to cyclosporine (CsA)-based regimens in the first year. However, the long-term follow-up data did not reveal better outcomes in the Tac-based regimens. In view of the short term benefits, the trend has been to change to Tac-based regimens off late. Data from the Indian subcontinent are, however, sparse. We, therefore, looked at our data to ascertain if Tac-based regimen does have better outcomes in our population. We studied a total of 108 individuals who underwent renal transplantation between January 2007 and June 2013, with a mean follow-up of 38.22 months (comparable to both groups). In our group, males constituted 77.8%,; and among the 108 individuals, 16.7% were diabetics. New-onset diabetes after renal transplantation was more common in the Tac group (21 vs. 12 and was statistically significant [P = 0.03]). At the last follow-up, serum creatinine was higher in the CsA group (1.77 mg/dl vs. 1.35 mg/dl) and was statistically significant (P = 0.03). Individuals requiring hemodialysis were also significantly higher in the CsA group (9 vs. 2; P = 0.05). The patient survival was similar in both groups (1-year and 5-year follow-up); however, graft survival was better in Tac group as compared to CsA group (0.94 vs. 0.88 at 1 year and 0.85 vs. 0.72 at 5 years)